Historic, archived document 


Do not assume content reflects current 
scientific knowledge, policies, or practices. 


ae 


United States 
Department of 
Agriculture 


Forest Service 


Northeastern Forest 
Experiment Station 


Research Note NE-345 


Abstract 


Adequately predicting the sampling errors of tabular data 
can reduce printing costs by eliminating the need to publish 
separate sampling error tables. Two generalized variance 
functions (GVFs) found in the literature and three GVFs 
derived for this study were evaluated for their ability to 
predict the sampling error of tabular forestry estimates. The 
recommended GVFs for most tables are either a GVF which 
incorporated the sampling errors of the row and column 
totals or a nonlinear GVF when the sampling errors are not 
published. Tables composed with one sampling intensity 
and containing data from a multinomial distribution can be 
represented by a simple linear estimator. 


Large surveys such as those conducted by USDA Forest 
Inventory and Analysis (FIA) Projects generate large 
amounts of information displayed in tabular form. Most 
often, the tables are published with few indications of the 
associated precision of the estimate. The lack of some 
measure of precision is not an oversight but is due to costs 
of publishing twice as many tables and the desire to 
maximize readability of the report. If a measure of precision 
is presented, it is often the sampling error in percent: 


SE(T,) = 100(var(T, 
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where Tj is the cell estimate and var(Tj;) is the sample 
variance of the estimate. The sampling error is in the form 
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of a simple approximation formula (Hines and Vissage, 
1988) or asummary table of selected sampling errors. The 
sampling errors selected for publication are usually the 
errors of the table totals or subtotals. These can 
inadvertently lead to the assumption that the sampling 
errors of the individual cells within the table are of a similar 
magnitude. While this may be true, it is more common that 
the sampling error of a row or table total is smaller than the 
sampling error of an individual cell estimate. The cell 
sampling error can be many times greater than the 
sampling error of its row or column total. The objective of 
this study is to present GVFs as a cost-effective alternative 
to printing sampling error tables, and to increase the 
amount of information available in publications that 
currently do not present sampling errors for each cell ina 
table. 


Methods 
Models 


The models investigated were selected from the literature or 
derived. For inclusion in the study, the model had to contain 
variables that were published regularly or that could easily 
be added to existing tables. They are the estimated totals of 
the individual cells, (Tj), the row totals, (7; ), the column 
totals, (Tj), the grand total (T_), and the sampling errors of 
the row and column totals, SE(T)) and SE(T;_). The **.”” 
signifies the summation over that row or column subscript. 
The sampling errors for the row and column totals 


could be printed as an additional column and row in the 
table. Hence, a model that utilized the row and column 
sampling errors was included in this study. 


The models investigated were: 


SE(Tj) = SECT )(T.Ti)® (1) 
SE(T) = (bp + bi(1/7))* (2) 
SE(T) = bp + b,(1/T))* (3) 
SE(Tj) = by Ty? 7. Ty (4) 


SE(Tij) = bp + b,SE(T))(Tj/Tj)* ‘ 
+ bySE(T, )(T, Tj) ” 


Model 1 is used by some FIA projects (Hines and 

Vissage, 1988) and has no unknown parameters that can be 
estimated from the data in the table. It assumes that the 
sampling error is inversely proportional to the square root of 
the cell value. It is also a special case of (3) with bb = 0.0 
and b; =SE(T_)/(T_)°. 


Model 2 has been used in large surveys by the Current 
Population Survey and the National Health Interview Survey 
in a related form (Valliant, 1987). Model 3 is the general 
form of (1). It allows for an intercept and does not force b, 
to equal a specific value but allows the data to determine 
the slope coefficient. 


Model 4 was derived to account for nonlinear relationships 
between the variables Tj, Tj, and Tj, and the dependant 
SE(Tj). Replacing bp with -.5 and eliminating Tj and Tj, the 
model becomes: 


SE(Tj) = 6, T° (6) 
and is identical to Model 3 if by is equal to zero. 


Model 5 was derived to incorporate as much information as 
possible about the variability of the cell sampling errors 
without printing all of the sampling errors. It uses 
information from the sum of the rows and columns as well 
as the marginal sampling errors. The sampling errors of the 
row and column totals must be published for this model to 
be used. 


Tests 


The models were evaluated on a portion of the tables 
produced by the Northeastern Forest Experiment Station’s 
FIA unit for the 1989 Kentucky inventory. Fifteen tables 
were divided into two types: those that contained 
continuous variables such as board-foot and cubic-foot 
volumes and those that were from multinomial distributions. 
An example of a multinomial variable is forest ownership in 
acres. The plot is either located on one of several ownership 
categories or it is not. The weighted proportion of plot 
occurrences in a category is multiplied by the total acreage 


in that county (assumed not to have error). Although 
acreage is usually considered as a continuous variable, it is 
treated as a constant. Other area tables in this study were 
derived in a similar manner. 


The coefficient of determination (R*) was the measure of 
performance. For Model 1 it was calculated as if it were a 
linear regression model; that is: 


yy = SE(T,)(T_/Tj)* 
R® = 1 - var(SE(Tj) - y;)/var(SE(T;))). 


Similarly, the R@s for the nonlinear models were: 


R? = 1 - SS corrected/MSE i = 2,4 
where SS is the sum of squares and MSE is the mean 
square error. 


Sampling errors greater than or equal to 100 percent were 
eliminated from the regressions since they represent cells 
with a single non-zero observation. However, we believe 
that the models can be applied to these cells, resulting in 
an improved estimate of the variance. 


Resuits 


The proportion of variation explained, R?, for each of the 
five models is shown in Table 1. Model 1 consistently 
explained less variation than the other models, and actually 
introduced more error than it explained for the continuous 
tables. Thus, Table 1 shows R? values of zero. 


Model 2 performed well with the area tables, producing R? 
values between 0.81 and 0.99. This is the only model for 
which a theoretical basis has been shown to exist for two 
stage sampling used in populations surveys (Valliant 1987). 
Valliant also shows how this model should do well for 
continuous variables exhibiting gamma distribution of X(b,1) 
where b and 1 are population parameters, so long as the 
variables meet certain conditions. One condition is that the 
second parameter must be constant across all strata in the 
population, an improbability with tables composed of two or 
more sampling intensities. 


Consequently, Model 2 did not perform well with the other 
variables. The R? values dropped to 0.49 to 0.84 for the 
board-foot, cubic-foot and number-of-stems tables. 


With one exception, Model 3 performed as well as or better 
than Model 2 and equaled Model 4 for several of the tables. 
Its lowest R? was 0.87 for the area tables, but Model 4 was 
considerably better for tables involving number of stems 
per acre. 


The range R®@ values for area tables in model 4 was 0.90 to 
0.99. It dropped to 0.67 to 0.92 for the continuous tables. 
The overall performance of Model 4 was second only to 


Table 1. Proportion of variation explained, R?, for each of the four models applied 
to 15 different tables from 1989 Kentucky forest survey 


R? for Model — ; 
Attribute Rows Columns 1 2 3 4 5 
Forest area County Owner 0.40 0.81 0.89 090 0.96 
class 
Forest area County Forest 0.43 0.87 093 095 0.98 
type 
Forest area County Stand 0.41 O81 O87 090 0.93 
size 
Forest area County Site 0.42 086 092 092 0.97 
class 
Forest area County Stocking 0.41 085 092 092 0.98 
class 
Forest area Forest Owner 0.50 098 0.98 098 0.97 
type class 
Forest area Owner Stocking 0.49 099 099 099 0.99 
class class 
Forest area Forest Stand 048 O99 097 098 0.98 
type size 
Net cubic-foot Species Diameter 0.00 076 0.78 0.82 0.84 
volume class 
Cubic-foot vol. Species Diameter 0.00 0.77 O.79 083 0.83 
in Sawlog class 
Board-foot Species Diameter 0.00 0.72 0.74 0.78 0.84 
volume class 
Net cubic-foot Tree Species 0.00 084 087 0.92 0.88 
volume class group 
Board-foot Species Log 0.00: O66 0.72 0:79 0:81 
volume grade 
No. live trees Species Diameter 0.00 049 053 069 0.88 
class 
No. growing- Species Diameter 0.00 049 053 0.67 0.87 
stock trees class 
Model 5. For the area tables, the estimate for the coefficient than Models 3 and 4 only for the two tables involving 
Do ranged from -.5 to -.6, and b3 and by were either not number of trees per acre. The marginally better performance 
significantly different from zero or barely significant at the was attributed to the additional information in the model 
0.05 level. from the sampling errors of the row and column totals. 
Model 5 had the best overall performance with R@ values of Many tables in forestry are constructed using more than one 
0.93 to 0.99 for the area tables and 0.81 to 0.88 for the sampling intensity. This alters the relationship of cell 
remaining tables. However, it performed substantially better estimate to cell variance, which is the basis for generalized 


variance functions. Most of the models were able to predict 
well the estimated cell variances for the area tables. The 
area tables are derived from information on the entire plot 
with each attribute recorded only once per plot. This is in 
contrast to the other tables where data are recorded on 
individual trees in various combinations of plot designs. The 
problem is especially acute when the table is separated into 
diameter classes. When more than one fixed plot size is 
used, the decision whether to record information from a tree 
is based on the tree’s distance from the plot center and the 
diameter. So the diameter classes in a table also 
correspond to plot size. The relationship between the cell 
totals and the sampling errors is not a ‘‘smooth curve”’ but 
actually two or more curves because sampling error 
increases as plot size decreases. 


The problem of estimating a table composed of more than 
one fixed plot size becomes apparent when the table is 
divided into diameter classes. The table for number of live 
trees per acre contains three fixed plot sizes and the 
columns are divided into diameter classes. Deciding which 
plot a tree belongs to depends on the diameter of the stem 
as well as its distance from the plot center, so the column 
headings divided the tables in plot sizes as well as diameter 
classes. Three separate equations, one for each plot size, 
was fitted to the table using Model 3; the R@ improved from 
0.53 to 0.82, indicating that the GVF differs for each of the 
three diameter-class groups. 


A model’s ability to predict the sampling errors of ‘‘mixed”’ 
sampling intensities depends on the amount of information 
about the table within the model and its degree of flexibility. 
Model 5 included information concerning the sampling 
errors of T; and JT; and consequently performed better than 
the other models. Model 4 was second because of its ability 
to adjust the power of the exponents to better estimate a 
compromise coefficient through the collection of points 
produced by the different sampling intensities. 


Conclusion 


Models 2 through 5 performed well for the area tables. If 
their performance can be classified as about equal, then 
two factors remain in deciding which model to use. The first 
is ease of use for the reader — the fewer the number of 
coefficients to multiply and the fewer variables to locate in 


the table, the better. The second factor is ease in 
implementing the regression procedure into existing 
software and any maintenance of the procedure that may be 
required by the producers of the tables. Nonlinear models 
are more difficult to implement because of the initial 
estimates of the parameters for each table or groups of 
tables and the possibility of changing the initial values from 
one data set to another. Changing the estimates can 
become burdensome, especially when many tables are 
produced as a part of a larger program requiring 
considerable computer resources. On the basis of these 
factors, Model 3 is recommended. 


For tables constructed from more than one sampling 
intensity, Model 5 is recommended if SE(T) and SE(T|) are 
published. Otherwise, Model 4 is recommended so long as 
the problems of implementing a nonlinear subroutine are 
not considered significant. Model 3 is a consideration 
should Models 4 and 5 be deemed unsuitable. 


Including generalized variance functions with tables has 
been shown to be feasible even with tables composed of 
varying sampling intensities. GVF’s have the advantage of 
saving costs as compared to publishing separate tables of 
sampling errors, adding utility to those publications that 
currently do not publish sampling errors for all tables, and 
maintaining a high degree of readability in the final product. 
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